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Abstract 

Background: Multiple sequence alignment (MSA) is a fundamental analysis method used in bioinformatics and 
many comparative genomic applications. Prior MSA acceleration attempts with reconfigurable computing have 
only addressed the first stage of progressive alignment and consequently exhibit performance limitations 
according to Amdahl's Law. This work is the first known to accelerate the third stage of progressive alignment on 
reconfigurable hardware. 

Results: We reduce subgroups of aligned sequences into discrete profiles before they are pairwise aligned on the 
accelerator. Using an FPGA accelerator, an overall speedup of up to 150 has been demonstrated on a large data 
set when compared to a 2.4 GHz Core2 processor. 

Conclusions: Our parallel algorithm and architecture accelerates large-scale MSA with reconfigurable computing 
and allows researchers to solve the larger problems that confront biologists today. Program source is available 
from http://dna.cs.byu.edu/msa/. 



Background 

Biologists and other researchers use multiple sequence 
alignment (MSA) as a fundamental analysis method to 
find similarities among nucleotide (DNA/RNA) or amino 
acid (protein) sequences. The compute time for an opti- 
mal MSA grows exponentially with respect to the num- 
ber of sequences. Consequently, producing timely results 
on large problems requires more efficient algorithms and 
the use of parallel computing resources. Reconfigurable 
computing hardware, such as Field-Programmable Gate 
Arrays (FPGAs), provides one approach to the accelera- 
tion of biological sequence alignment. Other acceleration 
methods typically encounter scaling problems that arise 
from the overhead of inter-process communication and 
from the lack of parallelism. Reconfigurable computing 
allows a greater scale of parallelism using many fine- 
grained custom processing elements that have a low- 
overhead interconnect. 

The most common algorithm used to solve the MSA 
problem is progressive alignment [1-3]. This algorithm 
consists of three main stages. The first stage compares 
all the sequences with each other producing similarity 
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scores only. Since this stage is easily paralleUzed, it has 
traditionally been the focus of parallelization efforts; 
however, speedup is limited without accelerating the fol- 
lowing stages. The second stage of MSA groups the 
most similar sequences together using the similarity 
scores to form a tree that guides alignment in the next 
stage. Finally, the third stage successively aligns the 
most similar sequences and groups of sequences until 
all the sequences are aligned. Groups of aligned 
sequences are converted into profiles before alignment 
with a pairwise dynamic programming algorithm. A pro- 
file represents the character frequencies for each column 
in an alignment. In Stage 3, traceback information from 
full pairwise alignment is required to align groups of 
sequences. 

Accelerator technology requires moving data from the 
host address space to the accelerator before computation. 
If the computation rate on the accelerator exceeds the 
communication rate with the host, performance will be 
limited. Ideally, the communication rate is at least equal 
to or greater than the computation rate. FPGAs are cap- 
able of handling parallel computations on many small 
integer data types; however, floating-point operations 
require more resources and consequently fewer opera- 
tions fit within the same logic. Reducing complex profiles 
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to a simpler integer form allows greater performance on 
the accelerator by lowering the needed communication 
rate and permitting more processing elements. 

In this work, a new method for accelerating the third 
stage is described that reduces subgroups of aligned 
sequences into discrete profiles before they are pairwise 
aligned on the accelerator. Our pairwise alignment algo- 
rithm [4] produces the required traceback information 
and does not limit the sequence length by the number 
of processing elements (PEs) or by the amount of block 
RAM on the accelerator. Other hardware acceleration 
methods are inadequate for use in the third stage 
because the sequence length is severely limited or only 
similarity scores are computed. 

Alignment quality of the new method is assessed with 
the BRAliBase benchmark RNA alignment database [5] 
that consists of 18,990 RNA alignments and with the 
MDSA data set [6]. Discrete profile alignment is shown 
to have comparable quality to other popular MSA pro- 
grams and an accelerated version of the program 
demonstrates two orders of magnitude speedup. 

Related Work 

Most efforts to accelerate bio-sequence applications with 
hardware have focused solely on database searches and 
have employed a pairwise local comparison algorithm. 
Ramdas and Egan [7] discuss several FPGA-based archi- 
tectures in their survey. Other pairwise comparison 
accelerators have also been described in [8-10]. A few 
methods to accelerate MSA with hardware have been 
demonstrated, but they fail to use all the available parallel 
resources in every stage of MSA; consequently, perfor- 
mance is reduced in some stages with idle processors. 

Without accelerating the most time consuming stages 
of progressive MSA, Amdahl's law [11] limits the overall 
speedup. For example, if the third stage takes 5% of the 
computation time, the overall speedup is limited to about 
20 even if the other stages are infinitely fast. If the time 
in Stage 1 is reduced with faster comparison techniques, 
then the acceleration of Stage 3 becomes more critical. 
Newer programs like MUSCLE [12] and MAFFT [13] use 
a faster alignment-free comparison method; therefore, 
the third stage usually dominates the computation time. 
Even though these newer methods show greater perfor- 
mance, most of the related work has still focused on 
accelerating ClustalW where the first stage dominates 
the run time. 

Multiprocessor-Supercomputer 

Most attempts to accelerate MSA have been on 
shared-memory or distributed-memory systems using a 
coarse-grained parallel approach. Mikhailov et al. [14] 
shows a lOx speedup with 16 processors by parallelizing 
all three stages of ClustalW [3] with OpenMP [15] on a 
shared-memory SGI Origin machine. A notable feature 



of this effort is the parallelization of the guide tree cal- 
culation in the second stage. Deng et al. [16] parallelized 
several stages of MUSCLE [12] to realize a speedup of 
15 on a 16 processor shared-memory machine. Several 
attempts [17-20] have been made to parallelize ClustalW 
on distributed-memory systems using message passing. 
In these cases. Stages 1 and 3 were parallelized with the 
best performance reported by Lin et al. [21] having a 
speedup of 29 on 64 CPUs. Tan et al. [22] achieved a 
speedup of 35 on a hybrid multiprocessor-cluster system 
of 40 nodes with 80 CPUs. In the third stage. Tan's 
method distributes group-to-group alignments to system 
nodes using a method that is based upon guide-tree and 
recursive parallelism. The main contribution comes 
from computing the forward and backward DP scans in 
parallel on processors within a node. The small speedup 
achieved in the third stage, which is under 10 in most 
cases, limits the overall speedup of the progressive 
algorithms. 
Cell BE 

Recently, the Cell Broadband Engine has received 
attention as an accelerator for MSA. Vandierendonck et 
al. [23] have accelerated ClustalW by a factor of 8 when 
compared with a 2.13 GHz Intel Core2 Duo processor 
running a single thread. Stages 1 and 3 were parallelized 
on two Cell BEs by vectorizing DP matrix calculations 
and scheduling independent tasks across the 16 available 
synergistic processing elements. Using a Playstation3, 
Wirawan et al. [24] achieved a peak speedup of 108 on 
the first stage when compared with a 3.0 GHz Pentium 
4. Overall, a speedup of only 13.7 was observed on 1000 
sequences with an average length of 446. However, the 
announcement from IBM to discontinue Cell production 
for technical computing [25] may diminish further inter- 
est in the Cell. 

GPU 

Another popular acceleration technology is the general 
purpose graphics processing unit. Its commodity nature 
has sparked much interest outside of the graphics com- 
munity as an acceleration engine. Liu et al. [26] acceler- 
ated all three stages of ClustalW on the GPU, with the 
parallel portions programmed using CUD A [27]. When 
independent task and guide tree parallelism is low, cells 
of DP matrix calculations are computed in parallel. An 
overall peak speedup of 41.53 was demonstrated on 
1000 sequences of average length 858 with 1 GPU card 
(GeForce GTX 280) when compared with a 3.0 GHz 
Pentium 4. The best speedup obtained in each of the 
three stages is 47.13, 11.08, and 5.9 respectively. Again, 
the small gain in the third stage limits the overall 
speedup. 

FPGA 

Reconfigurable computing approaches accelerate the 
first stage of MSA by computing pairwise alignments 
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with a pipeline of processing elements (PEs). This linear 
systolic array operates with fine-grained parallelism 
along a wavefront of cells in the DP matrix. The Clus- 
talW algorithm does not use the score obtained from a 
pairwise alignment directly. Instead, the number of iden- 
tical characters in an alignment are used to compute the 
fractional identity. Oliver et al. [28] accelerates the first 
stage of ClustalW, but leaves the second and third 
stages for execution on the host processor. Instead of 
actually aligning the sequences, a custom algorithm on 
the accelerator counts the number of identical charac- 
ters during the forward scan without performing trace- 
back. The best overall speedup was 13.3 compared to 
ClustalW running on a 3.0 GHz Pentium 4. For Stage 1, 
a PCI-based accelerator board reaches a peak speedup 
of 50.9 with 92 PEs in a XiUnx XC2V6000. In another 
approach, Lin et al. [21] demonstrated an overall 
speedup of 34.6 using 10 Altera Stratix PEIS30 with a 
total of 3072 PEs. For the first stage, a speedup of 
1697.5 was achieved when compared with a 2.8 GHz 
Xeon. The number of identical characters is deduced 
from the comparison score returned from the accelera- 
tor and the sequence lengths. Even with the impressive 
speedup in the first stage, the overall speedup is still 
limited by the third stage. Greater performance may be 
achieved, however, by accelerating the third stage of 
progressive alignment. 

Methods 

Discrete Profile Alignment 

The third stage of MSA pairwise aligns profiles in a 
similar way to sequences, but it must also work with the 
extra information in profiles. Each position of a profile 
designates a point in continuous profile space with a 
vector of character frequencies (see Figure 1 and 2). 
Profile-based MSA applications typically use floating- 
point numbers or scaled integers to represent these 
character frequencies. The extra size and dimension of 
profiles, in relation to sequences, adds to the complexity 
of alignment. Hence, a reduced representation of pro- 
files that retains as much information as possible simpli- 
fies alignment. By reducing profiles to discrete profiles- 
essentially sequences with an extended alphabet-they 
may be aligned with a simpler, higher-performing, pair- 
wise sequence alignment algorithm. 

The concept of discrete profile space was introduced by 
Eskin [29] with application to DNA motif search, which 
finds relatively short patterns in a subject sequence. For 
instance, when searching for promoter sequences, query 
profiles have a length of about 8-12 positions. In Eskin's 
method, a motif is represented as a small, discrete profile 
that contains the probabilities of finding each nucleotide 
at the respective positions. A similar work by Wang and 
Stormo [30] partitions a four-dimensional continuous 
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Figure 1 Example multiple alignment and derived profile. Each 
position in a profile consists of a vector witli cliaracter frequencies 
for tine corresponding column in a group of aligned sequences, 
(a) Multiple alignment of sequences s,. (b) Profile derived from the 
alignment. 

I J 

profile space into 15 subspaces based upon a supervised 
learning algorithm. Each dimension corresponds to a 
nucleotide frequency = (a, c, g, t} with the constraint S 
^ = 1. Any point falling within a partition is then repre- 
sented by a discrete profile symbol. 

For the application of discrete profile space to MSA, a 
few issues and extensions must be addressed. For exam- 
ple, an additional dimension must be added to profile 
space to accommodate gaps. Also, sample points from 
profile space must be selected for representation with 




Figure 2 Profile space. In three dimensions, profile space is a 

triangle on the plane x + y + z = 1; however, five dimensions are 

required to represent DNA alignments. Points in profile space are 

shown with coordinates and an aligned column example 

(transposed). The corners of profile space represent columns of an 

alignment that contain all the same character. 
J 
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discrete symbols, and substitution costs need to be cal- 
culated between these sample points. Furthermore, a 
reduction method from continuous space to discrete 
symbols must be devised that can operate efficiently on 
genomic-sized profiles. 
Sample Points 

Five dimensions in profile space are required to repre- 
sent profiles that contain nucleotide and gap character 
frequencies. Each position of a profile can be mapped to 
a point that falls on the bounded hyperplane/4 +/c + 
/g + /r + fgap = 1 in Euclidean space where 0 < ^ < 1 
and N = {A, C, G, T, "-"}. To reduce the number of pos- 
sible points, a discrete number of sample points are 
selected from continuous profile space. These sample 
points and a corresponding discrete symbol represent 
nearby points in profile space. 

Given D dimensions, a selection algorithm determines 
sample points by projecting lattice points p in D-space 
onto the profile hyperplane according to the parametric 
equation p^= tp, where t = {1 - l.pi)/D. Let L denote the 
number of lattice points on the interval [0,1] of each axis. 
Lattice points (see Figure 3) are evenly spaced by a dis- 
tance of 1/L in each dimension; however, only points that 
he in a band near the hyperplane are considered. Given 
the sum of lattice point coordinates S = l^p^ the consid- 
ered points fall between (1 - < S < (1 - ;^). Intui- 
tively, these lattice points reside on parallel hyperplanes 
that are a distance of s = \/D/DL from each other (see 
Figure 4). Corners of profile space that consist of all one 
nucleotide are also included as sample points, but the 
point indicating a profile of all gaps is excluded. 

The number of sample points is reduced further by fil- 
tering points that represent less probable nucleotide 




Figure 3 Sample point determination. Sample points are 
determined by projecting lattice points onto the profile plane. 

v J 




X axis 

Figure 4 Planes parallel to profile space. Planes parallel to profile 
space are separated by a distance of s = \/D / DL ■ For this 
example, D = 3 and L = 4. 

V J 



frequencies. Nucleotides from the same group, either 
purine or pyrimidine, have a higher probability of being 
aligned, while those from different groups have a lower 
probability. Substitution tables reflect this probabflity in 
their cost values and influence alignment algorithms 
accordingly. Therefore, sample points with a high fre- 
quency of both purines and pyrimidines are eliminated 
if they meet the condition 

{fr+fj>T,)A[\f,-fj\ <Td) 

where / g {A, G} and j g {C, T}. The first term asserts 
that the combined frequency of purines and pyrimidines 
is above a threshold T^, and the second term asserts 
that both groups have similar frequencies with a differ- 
ence less than T^. Threshold values can be adjusted to 
allow more or less sample points depending on the 
desired number of discrete symbols. Effective starting 
values for the thresholds are = 0.75 and = 0.30. 
Substitution Table 

After sample points in profile space are selected, the 
substitution cost between these representative points 
can be determined. Instead of calculating the cost every 
time sample points are compared during alignment, the 
cost can be computed once and stored in a new sample 
substitution table. The discrete symbols associated with 
each sample point become the indices into the table and 
the codebook for a quantization algorithm. 

Substitution costs between sample points are com- 
puted from the individual nucleotide frequencies and 
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substitution costs. Since a hardware constrained imple- 
mentation of the sample substitution table may only 
have 4 or 8-bit entries, a scaling factor adapts the range 
of computed values to fit within entry size limits. Given 
the nucleotide substitution table 5 of size N x N, 3.n 
array of sample points c, and a scaling factor /?, the sub- 
stitution cost s' between each sample point is deter- 
mined by 



N N 
m=\n=l 



The substitution cost of a gap and a nucleotide is the 
gap extension cost plus one. This prevents a gap in one 
sequence from being followed by a gap in the other 
sequence during pairwise alignment of discrete profiles. 
Reduction 

For the accelerator to sustain maximum performance, the 
host system must supply reduced profiles at the accelera- 
tor s input data rate (see Figure 5). Profiles are reduced 
to discrete profiles to support a simpler, higher-perform- 
ing pairwise alignment algorithm on an accelerator that 
only aligns sequences of symbols. A new quantization 
technique is used for this reduction on the host to reach 
the needed performance. For each continuous profile 
position, the reduction algorithm searches for a nearby 
sample point and then returns the corresponding discrete 
symbol. Finding a nearby point in less time is preferred to 
a nearest neighbor search with greater overhead. Also, 
constraining the search to the profile hyperplane = 1 
allows for some optimization. 

A near neighbor search finds a sample point that is 
close to the given continuous point, but not necessarily 
the closest point. This relaxation of proximity allows the 
search to proceed in deterministic time, and thereby 
keep up with the accelerated pairwise alignment. Search 
begins by scaling and truncating each nucleotide fre- 
quency to form a partially quantized point. Then these 
integral coordinates are used as indices into a lookup 
table R that contains references to nearby sample points. 
The scale factor determines the number of quantization 
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levels for each coordinate and also the size of the 
lookup table. As a result of the search, points in contin- 
uous profile space are mapped to a small set of symbols 
that represent sample points. Not every element of the 
D-dimensional lookup table requires storage since the 
partially quantized points lie within a scaled distance of 
{D - l)s from the profile hyperplane. A ragged array 
with only the needed locations is used to implement the 
lookup table R, 
Example 

An example of discrete profile alignment is presented 
starting with two groups of aligned sequences. Profile 
formation, reduction, and pairwise alignment of the pro- 
files are included in this example. A simplified alphabet 
= {A, C, "-"} is used so that the character frequencies 
correspond with the X, Y and Z axes of a depictable 
three-dimensional profile space. 

Figures 6 and 7 show instances of profile calculation 
and reduction. Each profile position is calculated inde- 
pendently and corresponds with a column of aligned 
sequences. Given two groups of sequences {^i, 52} and 
{53, 54}, continuous profiles are calculated by counting 
the occurrence of characters in each column and divid- 
ing by the number of sequences to produce a vector of 
frequencies {fA>fc>fgap)' Profile reduction proceeds by 
scaling each vector and truncating the values to form 
indices into the three-dimensional reduction table Ra,c, 
gap- In this example, the scale factor is equivalent to the 
size of R in each dimension. Table lookup values, which 
are references to nearby sample points, are used for each 
position of the discrete profiles px,i and ^73^4. Figure 8 
depicts two points in profile space and the nearby sample 
points found by lookup in the reduction table R, Less 
probable sample points are not removed in this instance. 
Figure 9 shows the discrete profile alignment process and 
the final alignment of the original groups. The discrete 
profiles px,i and ^^3,4 are aligned with a pairwise algorithm 
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Figure 6 Example profile calculation and reduction for 
sequences 1 and 2. From the alignment {Si, Sj}, a continuous 
profile is derived and then reduced to form the corresponding 
discrete profile pi,2- 5 is a table of sample points. 
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Figure 7 Example profile calculation and reduction for 
sequences 3 and 4. From the alignment {S3, S4}, a continuous 
profile is derived and then reduced to form the corresponding 
discrete profile ^3,4. 5 is a table of sample points. 



that returns the edit string £1,2,3,4 composed of the opera- 
tions ei e {(Mis)Match, Insert, Delete}. This edit string 
also applies to the groups of sequences {^i, 52} and {53, 54} 
because of the position correspondence between align- 
ments and derived profiles. 

Experimental Setup 

The following components were incorporated into 
MUSCLE [12], an open-source MSA program, to 
demonstrate accelerated large-scale MSA. 

♦ SSE accelerated sequence similarity algorithms for 
the first stage of MSA 

♦ A discrete profile alignment algorithm for the third 
stage of MSA 

♦ An FPGA accelerated pairwise alignment algorithm 
[4] 
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Figure 8 Near neighbors in profile space. Given two profile 
points, nearby sample points and associated symbol codes are 
shown. 
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Figure 9 Example profile alignment. A pairwise alignment 
algorithm treats discrete profiles as sequences. The resulting edit 
operations ^1,2,3,4 indicate the computed alignment between the 
discrete profiles pi,2 and ^3,4, and the corresponding groups of 
sequences {si, Sj} and {S3, S4}. 



Corresponding code in MUSCLE was replaced with 
our highly-parallel code that uses SSE instructions and 
the FPGA accelerator. Discrete profile alignment 
replaced the float-based alignment used in each step of 
progressive alignment. Whenever two continuous pro- 
files are aligned, discrete profiles are first computed and 
then aligned. 

Pairwise sequence alignment is a fundamental sub- 
component of the discrete profile alignment algorithm. 
Sequences are aligned on the FPGA accelerator with a 
space-efficient dynamic programming algorithm and a 
traceback procedure. Given two sequences, the accelera- 
tor returns an edit string that describes an optimal 
alignment. This alignment functionality is incorporated 
into a C library that is called from the host platform. 

MUSCLE uses different methods to calculate sequence 
similarity in the first and second iterations. Both of 
these methods have been optimized with SSE instruc- 
tions. The first method does not compare sequences 
directly, but instead compares two vectors that contain 
k-mer counts from corresponding sequences. These vec- 
tors are compared 16 elements at a time with SSE 
instructions to find the minimum values. The second 
method determines the percent identity of two aligned 
sequences by counting the number of matched symbols. 
To accelerate this method, SSE instructions compare 16 
symbols from each sequence at once and increment cor- 
responding counts in a vector register. 

Two versions of the modified MUSCLE are used for 
analysis. One version (MUDISC) implements our pair- 
wise alignment in software on the host, while the other 
(MUFPGA) accelerates pairwise alignment on the FPGA. 
MUDISC is compared with other popular MSA programs 
such as ClustalW [3], Kalign [31], MAFFT [13], MUSCLE 
[12], and POA [32]. Program version numbers and com- 
mand-line options are shown in Table 1. For those 
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Table 1 Program version numbers and command-line 



options 


Version 


Program 


Options 


1.83 


ClustalW 


-type = dna 


2.04 


Kalign 


-nuc 


6.811 


MAFFT 


-nuc 


3.8.31a 


MUDISC 


-seqtype ma -maxiters 2 


3.8.31a 


MUSCLE 


-seqtype ma -maxiters 2 


2 


POA 


-do_global -do_progressive 



MUDISC and MUSCLE additionally use the option "-termgaps full" on 
BRAIiBase and "-termgaps ext" on MDSA when assessing alignment quality. 
MUDISC, MUFPGA, and MUSCLE use the options "-termgaps full -seqtype dna 
-gapopen 0 -center -300 -maxiters 2" for the performance test. 



programs that support iterations, a maximum of two are 
used. The non-accelerated MSA programs and MUDISC 
execute only on the conventional processor and 
MUFPGA additionally uses the FPGA accelerator. Both 
alignment quality and program performance are 
measured. 

Nucleotide adaptations [6] of the BAliBASE [33] and 
SMART [34] reference alignments are used to compare 
the quality of the MSA programs. BAliBASE alignments 
have been determined to be correct based upon known 
three-dimensional structure. Another assessment of 
alignment quality is obtained with the BRAIiBase bench- 
mark RNA alignment database [5] that consists of 18,990 
RNA alignments. Unaligned versions of the reference 
alignments are realigned to produce test alignments. 
Reference and test alignments are then compared with 
the scoring programs MSCORE and SCIF to produce an 
accuracy metric between 0 and 1. The MSCORE program 
combines several scoring methods, which are described 
in the Results section. 

A performance analysis uses a few large-scale, viral data 
sets that range in average length up to 167,043 nucleo- 
tides or contain up to 12,104 sequences. Overall program 
performance for MUSCLE, MUDISC, and MUFPGA is 
measured by the wall-clock time needed to align a data 
set and includes all three stages of progressive alignment. 
For accurate timing, the host processor's performance 
counters are used. 

The host platform consists of a desktop computer with 
8 GB of DRAM and a 2.4 GHz Intel Core2 Duo proces- 
sor running 64-bit Fedora 13 Linux as the operating sys- 
tem. All benchmark applications execute in a single 
thread and are compiled with gcc using -03 optimiza- 
tion. An 8-lane PCI Express [35] add-in card with a 
Xilinx Virtex-4 FXlOO FPGA provides the hardware 
acceleration for pairwise alignment. Acceleration occurs 
on a pipeline of 256 PEs driven by a 100 MHz clock. 
Each PE requires one block RAM to implement the sub- 
stitution cost 5-j a lookup table. The accelerator 



supports linear gap costs and up to 64 points in profile 
space with 6-bit symbol values. 

Results and Discussion 

Alignment quality with BRAIiBase 2.1 is depicted in 
Figure 10 for MUDISC and several other MSA pro- 
grams. The BRAliScore, which reflects the alignment 
accuracy, is plotted in relation to the average pairwise 
sequence identity (APSI) of the reference alignment. 
Identical sequences have an APSI of 100%. BRAliScore 
is composed of two independent scores and is calculated 
by multiplying the fractional identity (FI) [36] and the 
structure conservation index (SCI) [37]. The FI score is 
based on the fraction of matching characters between 
the test and reference alignment, whereas the SCI is not 
based on the reference alignment, but indicates the 
amount of secondary structure conserved in the 
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Figure 10 Alignment quality on the BRAIiBase data set MUDISC 

(the new method) is compared with several alignment programs on 

a seven (k7) and fifteen (kl5) sequence RNA reference set from 

BRAIiBase 2.1. A higher score indicates better quality and is shown 

in relation to the average pairwise sequence identity (APSI). MUDISC 

uses discrete profile alignment. 
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multiple alignment. A local smoothing of score values is 
applied with the acsplines option in gnuplot and a 
weighting factor of 5e-3. Above 60% APSI, there is little 
difference in the alignment quality between the pro- 
grams; however, MUDISC is one of the top performers 
on data sets below 60% APSI. 

A comparison of alignment quality with the MDSA 
reference sets is reported in Figure 11. The Q score, 
which is equivalent to the sum-of-pairs score (SPS) 
score [38], is shown in relation to the APSI. Unlike the 
FI, the Q score only considers residue pairs correctly 
aligned in the test alignment compared with the refer- 
ence and does not count residue-gap pairs. The acs- 
plines smoothing option is again used, but with a 
weighting factor of le-2. MUDISC is on par with other 
MSA programs down to about 40% APSI and is still 
comparable in accuracy below 40% APSI. 
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Figure 11 Alignment quality on the MDSA data set. MUDISC 
(the new method) is compared with several alignment programs on 
the MDSA data set which contains nucleotide adaptations of the 
BAIiBASE and SMART reference alignments. BAIiBASE includes 
reference sets 1-7. MUDISC uses discrete profile alignment. 
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The average alignment quality of MUSCLE and 
MUDISC is shown in Table 2. A variant of MUDISC 
that uses the nearest neighbor search method is also 
shown. According to the Friedman rank test [39] with 
an adjustment for ties, the difference in quality between 
the near and nearest neighbor search methods is not 
significant. Even though the average scores are very 
similar, the difference between MUSCLE and MUDISC 
is significant with MUSCLE ranking higher on BRAli- 
Base and MUDISC ranking higher on the MDSA data 
set. 

Program run times for MUSCLE, MUDISC, and 
MUFPGA are reported in Table 3. MUFPGA obtains an 
overall speedup of 33 relative to MUSCLE on the Influ- 
enza data set and a speedup of 154 on the HIV data set. 
Run times for MUFPGA on the Corona and Herpes data 
sets are estimated since the accelerator currently only 
supports sequence lengths up to 16 K. This maximum 
length is a function of the FPGA register sizes and the 
amount of memory configured to buffer intermediate 
sequences and is not limited by the method. The esti- 
mated run times for MUFPGA show the benefits of dis- 
crete profile alignment on long sequences when pairwise 
alignment is significantly accelerated. To calculate these 
values, the pairwise alignment time in MUDISC is 
reduced by a factor of 290, which is extrapolated from 
timings on the Influenza and HIV data sets. Pairwise 
alignment in the third stage is accelerated by a factor of 
176 on the Influenza data set and a factor of 283 on the 
HIV data set. Our prior work [4] has characterized the 
pairwise alignment performance of the FPGA accelerator 
and has shown a speedup over 300 relative to a desktop 
computer. 

Figure 12 shows the proportion of time spent in the 
three stages of alignment on the Influenza and HIV data 
sets. The time for each stage includes both iterations. 
SSE acceleration improves the first stage run time with a 
speedup of 31 on the Influenza data set and a speedup of 
79 on the Corona data set. Notice that the proportion of 
time spent in similarity calculations on the Influenza data 
set is greater with more sequences and limits the overall 
speedup. The Stage 3 run time of MUFPGA is faster than 
MUSCLE by a factor of 60 on the Influenza data set and 
by a factor of 198 on the HIV data set. The profile reduc- 
tion rate ranges from 44.1 to 98.1 Mpos/s and the reduc- 
tion time ranges from 5.4 to 11.7% of the pairwise 
alignment time on the accelerator. 

Conclusions 

The discrete profile alignment algorithm presented here 
produces alignments with quality comparable to other 
leading MSA programs and enables the acceleration of 
progressive alignment. A speedup over 150 is demon- 
strated when discrete profile alignment is combined 
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Table 2 Comparison of MUSCLE and MUDISC alignment quality 


Reference Set 


MUSCLE Avg. Score 


P-value, Rank 


MUDISC (near) Avg. Score 


P-value, Rank 


MUDISC (nearest) Avg. Score 


BRAIiBase k7 


0.6846 


1 .42e-4, > 


0.6851 


0.875, > 


0.6839 


BRAIiBase kl5 


0.6914 


1.18e-3, > 


0.6819 


0.285, < 


0.6814 


MDSA BAIiBASE 


0.3125 


9.58e-9, < 


0.3623 


0.806, < 


0.3629 


MDSA SMART 


0.6195 


7.46e-3, < 


0.6295 


0.830, > 


0.6316 



The average quality scores for MUSCLE and MUDISC are shown on four reference sets. BRAIiScore is reported for BRAIiBase, and Q score is reported for MDSA. A 
variant of MUDISC that uses the nearest neighbor search method is also shown. The P-values from a Friedman rank test indicate the difference between two 
adjacent programs, and the relational symbols indicate which program ranked higher. 



Table 3 Alignment program run times and overall 
speedup 

Data Sequences Avg. MUSCLE MUDISC MUFPGA Speedup 
Set Length 



a single core. Since this rate is sufficient to support the 
high-end performance of reconfigurable computing, 
other acceleration methods based on GPUs or SSE 
instructions may also be a viable option. 

Minimizing the time for sequence similarity calcula- 
tions in Stage 1 is also important to achieve significant 
speedup, especially for data sets with large numbers of 
sequences. Using SSE instructions reduces the time for 
sequence similarity calculations by a factor of 30-80. 
Thousands of sequences can be aligned in a few minutes 
when Stage 1 is accelerated with SSE instructions and 
Stage 3 is accelerated with reconfigurable computing. 

Future work includes the extension of discrete profile 
alignment to support proteins that have a larger alpha- 
bet by adding more dimensions to profile space. Since 
proteins have an amino acid alphabet of 20 characters, 
instead of 4 like DNA, an alphabet compression scheme 
would be necessary to reduce the number of characters 
and the corresponding dimensionality of profile space to 
a practical number. Reducing the alphabet to six classes 
based on physico-chemical properties, as done in 
MAFFT, would only require 6 dimensions for the classes 
plus 1 for gaps. 

Another area of investigation is to apply the coarse- 
grained parallelism of cluster supercomputers and the 
fine-grained parallelism of reconfigurable computing to 
multiple sequence alignment. Since this work only uses 
a single core and one accelerator, a cluster with reconfi- 
gurable computing could provide an estimated 20-30x 
speedup beyond this work. 
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